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ABSTRACT  temporary  object  tagging  approaches  fail  to  meet  these 


We  define  opportunistic  computing  environments  as 
augmented  reality  environments  that  attempt  to  mini¬ 
mally  perturb  existing  work  practices  and  allow  users  to 
opportunistically  take  advantage  of  them.  In  this  paper, 
we  present  Rasa,  an  opportunistic  computing  environ¬ 
ment  for  military  command  and  control  that  augments  the 
physical  objects  on  a  command  post  map  by  observing 
and  understanding  the  users’  speech,  pen,  and  touch- 
based  multimodal  language.  We  give  a  thorough  account 
of  Rasa’s  underlying  multiagent  framework,  recognition, 
understanding,  and  multimodal  integration  components. 
Finally,  we  examine  three  properties  of  language — 
generativity,  comprehensibility,  and  compositionality — 
that  render  it  suitable  as  an  augmentation  scheme,  and  we 
compare  these  properties  to  those  of  current  tagging 
technologies  and  approaches. 
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INTRODUCTION 

This  paper  describes  an  opportunistic  computing  envi¬ 
ronment — an  augmented  reality  environment  that  at¬ 
tempts  to  minimally  perturb  existing  work  practices  and 
allows  users  to  opportunistically  take  advantage  of  exist¬ 
ing  tools  and  processes. 

We  begin  by  describing  a  common  work  practice  we 
observed  in  a  military  command  post,  where  we  found 
people  already  augmenting  physical  objects.  Based  on 
these  observations,  we  developed  a  set  of  design  con¬ 
straints  to  support  the  development  of  an  augmented  real¬ 
ity  environment  that  is  able  to  employ  these  preexisting 
augmentations.  In  a  recent  paper,  we  showed  that  con- 
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augmentation  catena  [18].  Here  we  discuss  why  lan¬ 
guage,  due  to  its  generativity,  compositionality  and  com- 
iwchensibility  meets  each  criterion,  and  why  that  makes  it 
an  especially  attractive  augmentation  method.  Finally,  we 
jHCsent  a  detailed  description  of  Rasa — a  system  that 
understands  the  language  officers  use  in  military  com¬ 
mand  posts  and  updates  digital  command  and  control 
systems  by  capturing,  recognizing,  and  understanding 
tills  language  used  to  augment  physical  objects  on  a  pa¬ 
per  map. 

WORK  PRACTICE 

At  Ft.  Leavenworth,  Kansas  and  at  other  military  bases, 
we  observed  commanders  and  their  subordinates  engag¬ 
ing  in  command  and  control  of  armed  forces.  The  photo- 
giaftii  in  Figure  1  was  taken  during  an  especially  frenetic 
period  in  the  command  post. 

Oh  the  left  is  a  rear-projected  SmartBoard™  and  on  the 
n^i  is  a  Sun  Microsystems  workstation.  Several  other 


Figure  1.  State  of  the  art  military  command  and  control 
systems  in  action.  Photo  courtesy  of  William  Scherlis. 
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systems  are  in  the  immediate  foreground.  On  each  is  one 


Figure  2,  What  commanders  prefer.  Photo  courtesy  of 
William  Scherlis 


terns.  Notice  that  no  one  is  using  these  systems. 

During  this  critical  phase  of  work,  the  commander  and 
his  staff  have  chosen  to  use  a  different  set  of  tools  than 
the  ones  designed  for  this  task.  They  have  quite  purpose¬ 
fully  turned  their  backs  on  computer-based  tools  and 
graphical  user  interfaces.  Rather,  we  see  that  they  have 
chosen  to  use  a  large  8-foot  high  by  4-foot  wide  paper 
map,  arrayed  with  Post-it"^  notes  (Figure  2). 

This  large  map,  one  of  the  primary  command  and  control 
tools  used  by  commanders,  has  a  two-fold  purpose:  1)  to 
depict  the  terrain,  its  occupants  (military  units  consisting 
of  soldiers  and  machinery),  their  position,  and  capabili¬ 
ties;  and  2)  to  overlay  that  information  with  a  graphical 
rendition  of  the  daily  plan  for  the  force.  The  map  is  kept 
up-to-date  by  constant  communication  both  up  and  down 
their  organizational  hierarchy. 

The  overlays  are  Mylar  sheets  reproduced  each  night  by 
planners.  Copies  are  shared  amongst  relevant  units.  A 
symbol  representing  a  unit’s  functional  composition  and 
size  is  sketched  in  ink  on  each  Post-it.  As  unit  positions 


amve  over  the  radio,  the  Post-its  representing  these  units 
are  moved  to  their  respective  positions  on  the  map. 

The  users  establish  associations  between  the  Post-it  notes 
and  their  real-world  entities  with  a  standardized  language 
used  since  Napoleon’s  time  that  is  capable  of  denoting 
thousands  of  units  types. 

The  officers  choose  paper  for  a  variety  of  reasons.  It  has 
extremely  high  resolution.  Moreover,  it  is  malleable, 
cheap,  and  lightweight,  and  thus  can  be  rolled  up  and 
taken  anywhere.  Additionally,  people  like  to  handle 
physical  objects  as  they  collaborate.  As  officers  debate 
the  reliability  of  sensor  reports  and  human  observation  to 
determine  the  actual  position  of  units  in  the  field,  they 
jab  at  locations  on  the  map  where  conflicts  may  arise. 
They  also  pick  up  Post-it  notes  and  hold  them  in  their 
hand  while  they  debate  a  course  of  action. 

Rasa  perceives  the  augmentations  resulting  from  the  us¬ 
ers’  interacting  multimodally  with  these  physical  objects 
by  understanding  this  language.  Consequently  users  can 
continue  to  employ  familiar  tools  and  procedures,  which 
in  turn  create  automatic  couplings  to  the  digital  world.  In 
the  next  section,  we  examine  other  methods  that  have 
been  used  to  augment  environments. 

Augmenting  physical  objects 

Researchers  have  used  two  methods  to  augment  physical 
objects: 

•  sensing  physical  properties  (weight  [24],  shape, 
color,  size,  location) 

•  affixing  sensable  properties,  hereafter  called  sim¬ 
ply  “tags”  (bar  codes  [9],  glyphs  [19,  25],  radio¬ 
frequency  identifier  tags  [26],  etc.) 

Most  researchers  will  agree  that  sensing  the  physical 
properties  of  arbitrary  objects  to  identify  them  uniquely 
is  still  quite  difficult,  except  in  tightly  constrained  envi¬ 
ronments.  The  Perceptive  Workbench  [15]  is  one  notable 
example.  Thus,  most  unencumbering  augmented  reality 
systems  rely  on  tags  to  augment  objects. 

We  considered  each  of  these  tagging  approaches  when 
we  began  to  design  Rasa,  but  none  of  them  met  the 
physical  constraints  needed  to  augment  the  Post-it  notes. 
Moreover,  none  fulfilled  the  design  constraints  discussed 
below  and  in  detail  in  [18]. 


Minimality 

Constraint 

Changes  to  the  work  practice  must  be 
minimal.  The  system  should  work 
with  user’s  current  tools,  language, 
and  conventions. 

Human 

Performance 

Constraint 

Multiple  end-users  must  be  able  to 
perform  augmentations. 

Malleability 

Constraint 

Because  users  gain  information  about 
the  real  world  object  over  time,  the 
meaning  of  an  augmentation  should 

be  changeable;  at  a  minimum,  it 
should  be  incrementally  so. 

Human 

Understanding 

Constraint 

The  users  must  be  able  to  perceive 
and  understand  their  own  augmenta¬ 
tions  unaided  by  technology.  More¬ 
over,  multiple  users  should  be  able  to 
do  likewise,  even  if  neither  are  spa¬ 
tially  nor  temporally  co-present. 
Users  must  also  understand  what  the 
augmentation  entails  about  the  corre¬ 
sponding  objects  in  the  real  world. 

Robustness  \ 
Constraint 

The  work  must  be  able  to  continue 
without  interruption  should  the  sys¬ 
tem  fail. 

None  of  the  prior  tagging  methods  provides  a  persistent 
representation  of  the  augmentation  that  would  make  it 
robust  to  every  kind  of  system  failure  (communications, 
power,  etc.),  a  common  occurrence  in  this  environment. 
Users  have  no  way  of  deciphering  bar  codes  and  glyphs 
without  computational  aid.  If  the  tag  readers  and  other 
systems  fail,  people  can  no  longer  understand  what  the 
tag  means.  With  none  of  these  methods  is  there  a  natural 
way  for  the  end-user  to  create  the  augmentations.  Finally, 
none  of  these  tagging  methods  could  be  introduced  into 
the  work  practice  without  engendering  a  great  degree  of 
change. 

In  the  next  section,  we  discuss  how  language  overcomes 
these  deficiencies.  Moreover,  we  examine  the  properties 
of  language  that  argue  for  its  consideration  as  a  tool  for 
augmenting  physical  objects. 

LANGUAGE 

Language  is  generative,  compositional,  and  comprehen¬ 
sible,  and  depending  on  whether  language  is  written  or 
spoken,  it  can  be  permanent  or  transitory,  respectively. 
These  attributes  make  language  a  particularly  suitable 
candidate  for  creating  augmentations.  By  “language,”  we 
mean  an  arrangement  of  perceptible  “tokens”  that  have 
both  structure  and  meaning.  This  definition  is  meant  to 
subsume  both  natural  spoken  and  written  languages,  as 
well  as  diagrammatic  languages  such  as  military  symbol¬ 
ogy* 

The  military  symbology  language  taught  to  all  soldiers 
consists  of  shapes  to  indicate  friend  (rectangle)  or  foe 
(diamond),  a  set  of  lines,  dots,  or  “X’s”  to  indicate  unit 
size  (squad  to  army),  and  a  large  variety  of  symbols  to 
indicate  unit  function  (e.g.,  mechanized,  air  defense,  etc.) 
denoted  by  combinations  of  meaningful  diagrammatic s. 
For  example,  an  armored  reconnaissance  unit’s  symbol  is 
a  combination  of  the  marks  used  for  armor  and  for  re¬ 
connaissance,  as  shown  in  Figure  3.  In  addition,  the 
unit’s  reporting  structure  (e.g.,  First  platoon,  Charlie 
company)  is  often  indicated  by  an  abbreviation  (1/C) 
written  to  the  side  of  the  symbol  (see  Figure  4).  In  virtue 
of  these  components  and  its  compositional  nature,  the 
symbol  language  can  denote  thousands  of  units  as  well  as 


their  position  in  the  unit  hierarchy.  Because  the  language 
is  compositional,  soldiers  are  able  to  understand  and  gen¬ 
erate  complex  concepts  in  terms  of  their  parts.  Moreover, 
soldiers  use  the  language  to  communicate  the  situation  to 
others  by  arraying  such  symbols  on  written  notes  or  other 
devices  (e.g.  pushpins)  on  the  map. 


Since  written  language  is  permanent,  it  leaves  behind  a 
persistent  trail  as  paper-based  augmentations  are  incre¬ 
mentally  applied.  Not  only  can  we  understand  shared 
languages  unaided,  we  can  often  recognize  the  author  of 
an  utterance.  Neither  of  these  abilities  is  resident  in  tags. 

Spoken  language  is  convenient  when  the  user  would  pre¬ 
fer  a  lack  of  persistence.  For  example,  rather  than  update 
the  permanent  shared  understanding  conveyed  by  the 
map,  spoken  language  is  often  used  to  name  or  refer  to 
objects.  This  generativity  is  another  unique  property  of 
language,  which  enables  users  to  create  references, 
placeholders,  metaphors,  symbols,  etc.  Users  often  name 
entities  (e.g.,  “advanced  guard”)  while  writing  their  sym¬ 
bology  on  the  Post-it  notes,  thereby  making  the  notes  a 
placeholder  for  entities  in  the  real  world. 

None  of  these  characteristic  properties  of  language — 
generativity,  compositionality,  and  comprehensibility — is 
present  in  tagging  approaches  to  augmented  reality  sys¬ 
tems.  One  might  imagine  designing  a  user  interface  to 
accompany  the  creation  and  reading  of  tags  that  offered 
these  capabilities.  However,  any  change  to  these  highly 
learned  behaviors,  such  as  the  introduction  of  computer 
interfaces,  which  is  not  minimally  disruptive  to  the  proc¬ 
ess  is  likely  to  be  resisted. 

In  summary,  the  language  used  in  the  command  post 
offers  an  ideal  means  for  augmenting  physical  objects 
such  as  Post-it  notes.  However,  in  order  to  take  advan¬ 
tage  of  the  users’  reliance  on  language,  a  system  must  be 
capable  of  understanding  it.  In  the  following  section,  we 
present  the  architecture  for  Rasa — a  system  that  enables 
multimodal  understanding  of  spoken  and  gestural  lan¬ 
guage  in  such  augmented 
environments,  which  is  de¬ 
rived  from  our  earlier  work 
on  QuickSet  [7]. 

DESCRIPTION  OF  USE 

When  the  user  first  sets  up 
Rasa  in  the  command  cen¬ 
ter,  he  unrolls  his  map  and 
attaches  it  to  a  SmartBoard 


Figure  4.  Attack  helicop¬ 
ter  symbol  called  First 
platoon,  Charlie  company. 


or  other  touch-sensitive  surface  (see  Figure  5).  A  paper 
map,  or  in  fact  any  Cartesian  portrayal  of  the  real  world 
(e.g.  photograph,  drawing,  etc.),  can  be  registered  to  a 
position  in  the  world  by  tapping  at  two  points  on  it  and 
speaking  the  coordinates  for  each.  Immediately,  Rasa  is 
capable  of  projecting  information  on  the  paper  map,  or 
some  other  display,  from  its  digital  data  sources.  For  ex¬ 
ample,  Rasa  can  project  unit  symbology,  other  map 
annotations,  3D  models,  answers  to  questions,  tables,  etc. 

As  a  user  receives  a  radio  report  identifying  an  enemy 
reconnaissance  company,  (1)  he  draws  a  symbol  denoting 
the  unit  on  a  Post-it.^  Simultaneously,  he  can  choose  to 
modify  the  object  with  speech.  For  instance,  he  draws  the 
reconnaissance  company  unit  symbol  in  Figure  3  and  at 
the  same  time  gives  the  unit  the  name,  "'Advanced 
guard''  via  speech.  (2)  The  system  performs  recognition 
of  both  speech  and  gesture  in  parallel,  producing  multiple 
hypotheses.  (3)  These  hypotheses  are  parsed  into  mean¬ 
ing  representations,  and  (4)  are  submitted  for  integration. 
(5)  Some  time  later,  the  user  places  the  Post-it  on  a  regis¬ 
tered  map  of  the  terrain  at  position  96-94.  (6-8)  This  ges¬ 
ture  is  recognized  and  parsed,  then  also  submitted  for 
integration.  After  successful  fusion  of  these  inputs.  Rasa 
says,  "Confirm:  Enemy  reconnaissance  company  called 
"advanced  guard*  has  been  sighted  at  nine-six,  nine- 
four."  The  user  can  then  disconfirm  the  system’s  re¬ 
sponse  if  it  is  in  error.  If  it  is  correct,  the  user  need  not 
confirm.  Further  action  implies  confirmation  [17].  After 
confirmation,  the  unit  is  inserted  into  a  database,  which 
triggers  a  message  to  external  digital  systems.  This  ex¬ 
ample  is  illustrated  in  the  diagram  in  Figure  6. 

The  next  section  describes  the  system  architecture  that 
makes  this  type  of  augmented,  multimodal  interaction 
possible.  Following  this  description,  we  will  examine  this 
example  of  Rasa’s  use  in  further  detail. 

ARCHITECTURE 

Rasa  consists  of  autonomous  and  distributed  software 
components  that  communicate  using  an  agent  communi¬ 
cation  language  in  the  Adaptive  Agent  Architecture 
(AAA)  [14],  which  is  backwards  compatible  with  the 
Open  Agent  Architecture  (OAA)  [6]. 

Agent  Framework 

The  AAA  is  a  robust,  facilitated  multi-agent  system  ar¬ 
chitecture  specifically  adapted  for  use  with  multimodal 
systems.  A  multi-platform  Java  agent  shell  provides  ser¬ 
vices  that  allow  each  agent  to  interact  with  others  in  the 
agent  architecture.  The  agents  can  dynamically  join  and 
leave  the  system.  They  register  their  capabilities  with  an 
AAA  facilitator,  which  provides  brokering  and  match¬ 
making  services  to  them. 


^  For  simplicity,  all  units  will  be  rectangular  shaped. 


Figure  5.  Photograph  of  Rasa. 


Rasa’s  understanding  of  language  is  due  to  multiple  rec¬ 
ognition  and  understanding  agents  working  in  parallel 
and  feeding  their  results  to  the  multimodal  integration 
agent  These  agents  and  the  human  computer  interface 
agents  are  described  below. 

User  Interfaces 

With  Rasa,  users  first  draw  on  a  pad  of  Post-it  notes  af¬ 
fixed  to  a  Cross  Computing  iPenPro™  digital  pen  tablet. 
A  paper  interface  agent,  running  on  the  computer  system 
connected  to  the  tablet,  captures  digital  ink,  while  the  pen 
itself  [woduces  real  ink  on  each  note.  However,  there  is 
no  computer  or  user  interface  visible,  other  than  the  Post- 
its  themselves. 

In  addition  to  the  Post-its,  map  annotations  can  be  cre¬ 
ated  multimodaliy  and  then  projected  onto  the  paper  map 
or  a  separate  surface.  As  new  units  are  placed  on  the 
map,  a  colored  shadow  indicating  their  position  and  their 
disposition  (friendly  or  enemy)  can  be  overlaid  on  the 
paper  map.  A  table  of  all  units  present  and  their  combat 
strength  can  be  projected  next  to  the  map  as  visual  con¬ 
firmation  of  the  status  of  each  unit.^  “Control  measures,” 
sudi  as  barbed  wire,  fortifications,  berms,  etc.  can  be 
drawn  directly  on  the  map’s  plastic  overlay.  The  result¬ 
ing  military  icon  for  that  object  is  projected  directly  onto 
the  map. 

Conceptually,  Rasa’s  user  interfaces  act  as  transparent 
interaction  surfaces  on  the  paper.  Whenever  the  user 
touches  the  map  and  the  touch-sensitive  surface  beneath 
it  or  uses  the  pen  on  the  iPenPro  tablet,  they  are  interact¬ 
ing  with  Rasa.  As  they  do,  messages  describing  the  digi¬ 
tal  ink  left  behind  on  those  surfaces  are  sent  to  the  facili¬ 
tator  for  distribution  to  the  relevant  agents. 

One  additional  invisible  interface  is  used,  text-to-speech, 
which  can  be  enabled  or  disabled  on  command  as 
needed.  Each  interaction  with  Rasa  produces  not  only 


^  Tins  type  of  table  is  often  found  next  to  the  map  tool  in  command 
posts. 
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Figure  6.  An  example  of  data  and  process  flow  in  Rasa’s  system  architecture 


visual  confirmation  if  a  projector  is  present,  but  also  au¬ 
dible  confirmation. 

Recognizers 

Interaction  with  any  of  these  paper  surfaces  results  in  ink 
being  processed  by  Rasa’s  handwriting  and  gesture 
agents.  Each  produces  simultaneous  output  to  be  proc¬ 
essed  by  the  multimodal  integrator  and  combined  as  ap¬ 
propriate.  At  the  same  time,  speech  recognition  is  also 
enabled,  providing  input  to  the  integrator.  These  agents 
and  their  abilities  are  discussed  below. 

Handwriting 

Paragraph’s  writer-independent  Calligrapher  handwriting 
recognition  engine  has  been  incorporated  as  an  agent  into 
Rasa.  Like  the  gesture  agents  described  below,  the  hand¬ 
writing  agent  receives  input  from  interactions  on  the  pa¬ 
per  surface  in  the  form  of  digital  ink.  The  ink  is  sent  from 
the  user  interfaces  as  individual  strokes  of  time-stamped, 
contextualized,  x-y  pairs  with  supplementary  informa¬ 
tion.  The  context  supplied  depends  upon  the  user  inter¬ 
face — maps  provide  a  context  of  a  location  on  the  earth, 
while  blank  paper  has  no  context. 

Calligrapher  can  recognize  natural  letter  shapes,  includ¬ 
ing  cursive,  printed,  and  mixed  case.  Furthermore,  given 
a  vocabulary  from  the  domain,  it  can  distinguish  between 
vocabulary,  non-vocabulary,  and  non-handwriting  (other 
ink-drawn  gestures).  Combining  this  ability  with  Rasa’s 
other  pen-based  recognizers,  Rasa  can  recognize  and 
understand  mixed  symbolic  and  handwritten  drawings, 
like  that  shown  in  Figure  6. 


Gesture 

Rasa’s  gesture  agents  recog¬ 
nize  symbolic  and  editing 
gestures,  such  as  points, 
lines,  arrows,  deletion,  and 
grouping,  as  well  as  military 
symbology,  including  unit 
symbols  and  various  control 
measures  (barbed  wire,  forti¬ 
fication,  boundaries,  axes  of 
advance  etc.)  based  on  a 
hierarchical  recognition 
technique  called  Member- 
Team-Committee  (MTC) 
[27]. 

The  MTC  weights  the  con¬ 
tributions  of  individual  rec¬ 
ognizers  based  on  their  em¬ 
pirically  derived  relative 
reliabilities,  and  thereby 
optimizes  pattern  recogni¬ 
tion  robustness.  It  uses  a 
divide-and-conquer  strategy, 
wherein  the  members  pro¬ 
duce  local  posterior  esti¬ 
mates  that  are  reported  to  one  or  more  “team”  leaders. 
The  team  leaders  apply  weighting  to  the  scores,  and  pass 
results  to  the  committee,  which  weights  the  distribution 
of  the  team’s  results.  Using  the  MTC,  the  symbology 
recognizer  can  identify  200  different  military  unit  sym¬ 
bols,  while  achieving  a  better  than  90%  recognition  rate. 

Speech 

Rasa  receives  its  spoken  input  from  a  microphone  array 
attached  to  the  top  of  the  SmartBoard,  directly  above  the 
map  or  from  wireless  microphones.  The  speech  agent 
uses  SAPI  speech  recognition  engines  from  both  Dragon 
Systems  and  Microsoft,  or  IBM’s  Voice  Type  Applica¬ 
tion  Factory  (VTAF),  all  continuous,  speaker- 
independent  recognizers.  Training  can  be  used  to  in¬ 
crease  the  accuracy  of  both  S  API-compliant  engines.  The 
SAPI  engines  use  context-free  grammars,  while  the 
VTAF  engine  uses  bigrams.  VTAF  is  limited  to  one  hy¬ 
pothesis  per  spoken  phrase;  the  two  other  engines  pro¬ 
duce  n-best  lists  for  each  phrase  (word  scores  are  also 
available).  Rasa’s  vocabulary  is  approximately  675 
words,  and  the  grammar  specifies  a  far  greater  number  of 
valid  phrases. 

The  speech  agent  activates  the  recognizer  when  the  user 
interacts  with  paper  (touch  to  talk)  or  at  all  times  (open 
microphone).  The  speech  agent  and  microphone  need  not 
reside  on  any  one  computer  in  the  augmented  environ¬ 
ment  due  to  the  distributed  nature  of  the  software 
architecture.  Speech  hypotheses,  once  generated,  are 
forward  to  the  natural  language  parser  with  their  phrase 
scores  and  time  stamps. 


Parsers— producing  understanding  from  recognition 

Natural  language 

A  definite-clause  grammar  produces  typed  feature  struc¬ 
tures — directed  acyclic  graphs  (DAGs)  of  attribute  value 
pairs,  described  more  fully  below — as  meaning  represen¬ 
tation.  For  this  task,  the  language  consists  of  map- 
registration  predicates,  noun  phrases  that  label  entities, 
adverbial  and  prepositional  phrases  that  supply  additional 
information  about  the  entity,  and  a  variety  of  imperative 
constructs  for  supplying  behavior  to  those  entities  or  to 
control  various  systems. 

Gesture 

The  gesture  parser  also  produces  typed  feature  structures, 
based  on  the  list  of  recognition  hypotheses  and  probabil¬ 
ity  estimates  supplied  by  the  gesture  recognizer.  Typi¬ 
cally,  there  would  be  multiple  interpretations  for  each 
hypothesis.  For  example,  a  pointing  gesture  has  at  least 
three  meaningful  interpretations — a  selection  is  being 
made,  a  location  is  being  specified,  or  the  first  of  many 
point  locations  are  being  specified.  In  the  next  section, 
we  will  examine  how  these  multiple  interpretations  are 
weighed  in  the  multimodal  integrator. 

Requirements  for  Multimodal  Integration 

Rasa’s  multimodal  integration  technology  uses  declara¬ 
tive  rules  to  describe  how  the  meanings  of  input  from 
speech,  gesture,  or  other  modalities  must  be  semantically 
and  temporally  compatible  in  order  to  combine.  This 
fusion  architecture  was  preceded  by  the  original  “Put- 
That-There”  [2],  and  other  approaches  [5,  13,  20,  21]. 
However,  as  we  reported  in  [11]  these  prior  approaches 
are  limited  in  four  ways. 

1.  They  are  generally  restricted  to  simple  deictic  ges¬ 
tural  expressions. 

2.  They  are  primarily  driven  by  the  spoken  modality; 
whereas  first-class  language  exists  in  other  modali¬ 
ties  as  well. 

3.  They  have  not  provided  a  well-understood,  generally 
applicable  common  meaning  representation. 

4.  They  have  not  provided  a  formally  well-defined  de¬ 
clarative  mechanism  for  multimodal  integration. 

Our  approach  to  overcoming  each  of  these  limitations 
supports: 

•  Multiple  parallel  recognizers  and  “understanders” 
that  produce  meaning  fragments  from  continuous, 
parallel  coordinated  input  streams. 

•  A  common  meaning  representation — typed  feature 
structures, 

•  A  general  application  of  rule-based  constraints 
that  satisfy,  among  other  things,  an  empirically 
based  [23],  time-sensitive  grouping  process. 

•  A  well-understood  and  semantically  well-defined 
fusion  algorithm  that  uses  declarative  rules  for 


combining  compatible  meaning  fragments — 
unification.  Unification  combines  both  comple¬ 
mentary  and  redundant  information,  but  rules  out 
incompatible  attribute  values. 

•  A  set  of  declarative  multimodal  grammar  rules 
that  enable  parsing  and  interpretation  of  natural 
human  input  distributed  across  multiple  simulta¬ 
neous  spatial  dimensions,  time,  and  speech. 

•  An  algorithm  that  chooses  the  best  semantically 
complete,  joint  interpretation  of  multimodal  input, 
thus  allowing  one  mode  to  compensate  for  another 
mode’s  errors  [22]. 

In  general,  multimodal  inputs  are  recognized,  and  then 
parsed,  producing  meaning  descriptions  in  the  form  of 
typed  feature  structures.  The  integrator  fuses  these  mean¬ 
ings  together  by  evaluating  any  available  integration 
rules  for  the  type  of  input  received  and  those  partial  in¬ 
puts  waiting  in  an  integration  buffer.  Compatible  types 
are  unified,  and  the  candidate  meaning  combination  is 
subject  to  constraints.  Successful  unification  and  con¬ 
straint  satisfaction  results  in  a  new  set  of  merged  feature 
structures.  The  highest  ranked  semantically  complete 
feature  structure  is  executed.  If  none  are  complete,  they 
wait  in  the  buffer  for  further  fusion,  or  contribute  to  the 
ongoing  discourse  as  discussed  below. 

Typed  Feature  Structure  Unification 

Semantic  compatibility  is  captured  via  unification  over 
typed  feature  structures  [3,  4].  Unification  determines  the 
consistency  of  two  representational  structures,  and  if 
consistent,  combines  them  into  a  single  feature  structure. 
This  type  of  unification  is  a  generalization  of  term  unifi¬ 
cation  in  logic  programming  languages,  such  as  Prolog. 
However,  feature  structure  unification  differs  from  term 
unification  because  features  are  unordered  attribute-value 
pairs  of  atoms  and  variables  in  a  feature  structure,  rather 
than  positionally  encoded  attributes  in  a  term. 

When  two  features  structures  are  unified,  a  composite 
containing  all  of  the  feature  specifications  from  each 
component  structure  is  formed.  Any  feature  common  to 
both  feature  structures  must  have  a  compatible  value.  If 
the  values  of  a  common  feature  are  atoms,  they  must  be 
identical.  If  one  is  a  variable,  it  becomes  bound  to  the 
value  of  the  corresponding  feature  in  the  other  feature 
structure.  If  both  are  variables,  they  become  constrained 
to  always  receive  the  same  value.  If  the  values  are  them¬ 
selves  feature  structures,  the  unification  operation  is  ap¬ 
plied  recursively.  Importantly,  feature  structure  unifica¬ 
tion  results  in  a  DAG  structure  when  more  than  one  value 
uses  the  same  variable.  Whatever  value  is  ultimately  uni¬ 
fied  with  that  variable  will  fill  the  value  slot  of  all  the 
corresponding  features,  resulting  in  a  DAG. 

Typed  feature  structures  are  an  extension  of  the  represen¬ 
tation,  whereby  feature  structures  are  assigned  to  hierar¬ 
chically  ordered  types.  Typed  feature  structure  unifica- 


tion  requires  pairs  of  feature  structures  to  be  compatible 
in  type  (i.e.,  one  must  be  in  the  transitive  closure  of  the 
subtype  relation  with  respect  to  the  other).  The  result  of  a 
typed  unification  is  the  more  specific  feature  structure  in 
the  type  hierarchy.  Typed  feature  structure  unification  is 
ideal  for  multimodal  integration  because  it  can  combine 
complementary  or  redundant  input  from  different  modes, 
but  rules  out  contradictory  inputs. 

In  the  next  section,  we  describe  in  detail  how  Rasa’s 
multimodal  fusion  works,  which  is  described  fully  in  [10, 
11].  However,  in  the  following  section  we  will  present  a 
short  example  of  its  use  in  Rasa.  We  will  also  examine 
one  of  the  examples  of  how  Rasa  supports  human- 
computer  discourse.  We  then  briefly  demonstrate  the 
advantage  of  agent  architectures  in  the  support  of  human 
collaboration  systems. 


Multimodal  Fusion  in  Rasa 

To  demonstrate  how  multimodal  fusion  works  in  Rasa, 
let’s  return  to  the  example  given  above,  in  which  an  offi¬ 
cer  adds  a  new  unit  to  Rasa’s  augmented  map.  Rasa  re¬ 
sponds  in  the  following  manner. 


Speaking  the  unit’s 
name  (“advanced 
guard”)  generates  a 
typed  feature  struc¬ 
ture  similar  to  that 
shown  in  Figure  7. 
To  name  a  new  unit, 
the  name  should  be 


r 

object: 

unit 

name:  ‘AG’ 

create_unit 

Figure  7.  Typed  feature  structure 
from  spoken  utterance  ‘Advanced 
guard.’ 


uttered  while  drawing  a  symbol  that  specifies  the  remain¬ 
ing  constituents  for  a  unit,  such  as  the  reconnaissance 


""object:  [type:  recon 

1  size:  company 
unit^ 

create_unit 

Figure  8.  Typed  feature  structure 
resulting  drawing  recon  company. 


ognized  as  a  recon¬ 
naissance  company 
and  assigned  the 
feature  structure  in 
Figure  8. 

Rasa’s  fusion  ap¬ 
proach  uses  a  multi¬ 
dimensional  chart  parser,  or  multiparser,  based  on  chart 
parsing  techniques  from  natural  language  processing 
[12].  Edges  in  the  chart  (feature  structures  of  the  multi¬ 
modal  input)  are  processed  by  multimodal  grammar 
rules.  Unification  is  able  to  ensure  that  the  inputs  contain 
compatible  feature  structures  (attribute/values,  where 
values  can  again  be  typed  feature  structures).  A  declara¬ 
tive  set  of  temporal  constraints  is  used  that  were  devel¬ 
oped  based  on  empirical  investigation  of  multimodal 
synchronization  [23].  Spatial  constraints  are  used  for 
combining  gestural  inputs,  and  new  constraints  can  be 
declared  and  applied  in  any  rule. 


In  general,  multimodal  grammar  rules  are  productions 
LHS  ->  DTRl  DTR2;  two  daughter  features  (DTRl  and 
DTR2)  are  fused,  under  the  constraints  given,  into  the 


left-hand  side.^  The  shared  variables  in  the  rules,  denoted 
by  numbers  in  square  brackets,  must  unify  appropriately 
with  the  inputs  from  the  various  modalities,  as  previously 
described. 


One  of  Rasa’s  multimodal  grammar  rule,  shown  in  Fig¬ 
ure  9,  declares  that  partially  specified  units  (dtrl  and 
dtr2)  can  combine  with  other  partially  specified  units, 
so  long  as  they  are  compatible  in  type,  size,  location,  and 
name  features,  and  they  meet  the  constraints.  It  is  ex¬ 
pected  that  this  rule  will  fire  successfully  when  the  user 
is  attempting  to  create  the  particular  unit  using  different 
modalities  synchronously.  dtr2  is  a  placeholder  for 
gestural  input  (note  the  location  specification)  and  dtrl 
for  spoken  input,  but  this  need  not  be  the  case.  Figure  10 
demonstrates  partial  application  of  the  rule. 


Any  constraints  are  then  satisfied  using  a  Prolog  meta¬ 
interpreter.  For  example,  the  timing  constraints  for  this 
rule  (the  “overlap  or  follow”  rule  specification)  guarantee 
that  the  two  inputs  will  temporally  overlap  or  that  the 


3 


In  practice,  any  number  of  daughter  features  can  be  on  the  right-hand 
side  of  a  rule. 


gesture  will  precede  the  other  input  by  at  most  4  seconds. 

Figure  10  shows  that  after  fusion  the  left-hand-side  is 
still  missing  a  location  feature  for  the  unit  specification. 
In  the  next  section,  we  will  describe  how  Rasa  uses  com¬ 
pleteness  criteria  to  notice  this  missing  feature  and  query 
the  user  for  it. 


Discourse  in  Rasa 

The  completeness  criterion  and  constraint  work  together 
to  allow  each  feature  structure  to  carry  along  a  rule  that 
specifies  what  features  are  needed  to  complete  the  struc¬ 
ture.  In  this  way,  during  feature  structure  evaluation, 
rules  can  fire  that  effectively  instruct  Rasa  to  formulate 
subdialogues  with  the  user,  to  request  the  missing  infor¬ 
mation. 

For  example,  Figure  1 1  shows  the  completeness  criteria 
rule  for  units.  This  feature  structure  captures  the  attribute 
names  for  each  of  the  attributes  that  must  have  a  value 
before  the  feature  structure  is  found  to  be  complete.  In 
the  example  shown  here,  the  criteria  stipulates  that  every 
unit  must  have  an  object  feature  with  v^ues  for  type  and 


object: 

type:  [1] 

completeness: 

size:  [2]  p 
locationJ 

^  unit 

poim^ 

Figure  11.  Completeness  criterion  for  units. 


size,  as  well  as  a  location  feature  with  feature  structure 
type  point  and  a  coord  value. 

This  information  can  then  be  used  by  Rasa  to  produce 
queries  when  one  of  the  values  is  missing.  Rasa  asks  the 
user  for  the  positions  of  units,  after  a  tunable  delay,  that 
are  fully  specified  except  for  the  location  feature.  Users 
respond  by  placing  the  Post-it  note  on  the  map  or  by  dis- 
confirming  the  operation  and  throwing  the  unit  away.  If 
the  Post-it  note  representing  the  reconnaissance  company 
has  not  been  placed  on  the  map  within  10  seconds,  Rasa 
would  respond  by  saying  “Where  is  the  reconnaissance 
company  called  ‘Advanced  guard’?” 

Discourse  rules  can  be  declaratively  specified  in  Rasa  to 
promote  complete  mixed  initiative  collaborative  dia¬ 
logue.  This  is  left  for  future  work.  However,  human- 
computer  dialogue  is  not  the  only  aspect  of  collaboration 
supported  by  Rasa.  Collaboration  amongst  multiple  hu¬ 
man  users  is  also  supported. 

Collaboration  with  Rasa 

Because  user  interfaces  subscribe  to  and  produce  com¬ 
mon  messages,  when  they  connect  to  the  same  facilitator, 
they  immediately  become  part  of  a  collaborative  session. 
For  instance,  by  subscribing  to  the  entity-location  data¬ 
base  messages,  a  digital  QuickSet  user  interface  can  be 
notified  of  changes  in  the  locations  of  entities  when  a 
unit  is  moved  on  Rasa’s  paper  map.  Users  can  also  cou¬ 
ple  their  interfaces,  to  obtain  tighter  synchronicity.  Cou¬ 
pled  interfaces  subscribe  and  produce  common  “ink” 
messages;  meaning  one  user’s  ink  appears  on  the  others' 
map,  immediately  providing  a  shared  drawing  system. 

DISCUSSION 

Experimental  evaluation  of  Rasa  is  underway.  We  are 
conducting  a  field  test  with  military  personnel  to  evaluate 
its  performance  against  both  paper  map  and  computer- 
based  map  systems.  These  experiments  should  validate  or 
disprove  the  arguments  for  opportunistic  computing  that 
we  have  made  here. 

Rasa  and  QuickSet’ s  multimodal  fusion  architecture 
based  on  declarative  multimodal  integration  rules  makes 
it  possible  to  easily  declare  the  rules  for  Rasa’s  multi¬ 
interface  input.  The  ability  to  declaratively  specify  these 
rules  reduced  the  construction  time  of  Rasa  significantly. 
Having  an  agent  framework  and  several  useful  agents 
available  from  the  QuickSet  system  also  fostered  rapid 
assembly  of  the  first  Rasa  prototype. 


Opportunistic  computing  environments  are  not  the  only 
type  of  augmented  environments  to  benefit  from  the  ad¬ 
aptation  of  multimodal  language.  With  colleagues  from 
Columbia  University  [8],  we  are  currently  investigating 
how  multimodal  processing  can  be  applied  to  systems 
that  augment  human  senses.  Specifically,  we  are  adapting 
our  multimodal  fusion  capability  to  support  pointing  and 
gesturing  in  virtual  and  augmented  worlds. 

RELATED  WORK  IN  PAPER-BASED  INTERFACES 

Several  recent  approaches  have  successfully  augmented 
paper  in  novel  ways  [1,  16,  19,  26].  However,  none  of 
these  approaches  treat  the  existing  language  of  work  as 
anything  other  than  an  annotation.  These  systems  can 
capture  the  augmentations,  but  cannot  understand  them. 
Consequently,  though  they  augment  paper,  and  even 
support  tasks  involving  written  language,  they  are  unable 
to  take  advantage  of  its  properties  as  a  means  of  aug¬ 
menting  the  physical  objects,  as  Rasa  has  done. 

CONCLUDING  REMARKS 

We  have  presented  Rasa,  an  environment  for  opportunis¬ 
tic  computing,  where  physical  objects  are  computation¬ 
ally  augmented  by  a  system  observing  users’  work. 

We  have  shown  how  language  has  properties  that  are 
especially  suited  for  opportunistic  computing  environ¬ 
ments.  By  virtue  of  the  generativity  of  language,  users 
can  create  augmentations;  by  virtue  of  its  compositional- 
ity  a  large  set  of  augmentations  are  possible;  by  virtue  of 
its  comprehensibility  users  other  than  the  author  can  un¬ 
derstand  it;  and  by  virtue  of  its  persistence,  the 
augmentation  remains  understandable  in  the  face  of  fail- 

^a  leverages  these  benefits  of  language,  by  understand¬ 
ing  the  augmentations  that  officers  in  command  posts 
place  on  Post-it  notes  and  paper  maps,  resulting  in  the 
coupling  of  these  placeholders  with  their  digital  counter¬ 
parts. 

We  choose  this  approach  because  we  feel  we  have  no 
choice.  The  users  have  set  aside  their  computational  aids 
and  have  resorted  to  the  paper  tools.  Our  view  is  that 
users  should  not  have  to  choose,  but  that  we  should  begin 
to  augment  their  choice  of  tools  in  a  way  that  leaves  them 
unchanged. 
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